At the start of the previous installment of this series, I noted that good, scalable database performance is largely dependent on putting the data in the right place, being able to access it efficiently, and avoiding unnecessary overheads. Having looked at ways of generating data of various types and patterns at reasonable volumes, the time has come to investigate how SQL Server handles the placing of data.
Pages and Extents
In Oracle, we talk about data segments. A simple object corresponds to a single data segment, each partition of a partitioned object (or sub-partition of a composite partitioned object) is a separate data segment; segments, in turn, comprise a collection of extents; and an extent is a contiguous set of blocks (pages) in a file.
I started my research into similar data structures in SQL Server. There is no such thing as a “segment” in SQL Server, but a search in Books Online for the word “Extents” proved to be a good choice. The first hit was an article on Understanding Pages and Extents, which was an excellent starting point for learning how SQL Server allocates space and stores data.
I won’t repeat the details of my journey through the manuals, but instead will summarize some of the highlights (with comments about Oracle in italics):
- A page is the smallest unit of storage and is a fixed 8KB
Oracle can use blocks of 2KB, 4KB, 8KB, 16KB or even, on some platforms, 32KB. A single database can use multiple block sizes. In most cases the default, and best choice, for a platform is 8KB, and mixing block sizes is rarely a good idea. - An extent is a collection of eight consecutive pages
Oracle allows for variable extent sizes, and has various strategies for specifying the extent sizes that should be used for an object or collection of objects. - Certain “mapping” pages are used to track the allocation status of other pages, providing information such as which extents have been allocated, the objects to which they have been allocated, which pages in which extents are used, and which pages are free. This leads to references to pages such as Global Allocation Map (GAM) pages, Shared Global Allocation Map (SGAM) pages and Index Allocation Maps (IAM) pages.
Oracle has two dramatically different ways for dealing with extent allocation (“dictionary managed” and locally managed”, and two ways of dealing with block usage (“freelist managed” and “bitmap managed”), but I won’t go into details. - A single extent may hold pages from several objects. This is a mechanism to avoid wasting space but is only used when an object is very small.
In Oracle, all the blocks in an extent belong to a single object, but a single extent can be as small as two blocks, and in the most recent version of Oracle an object doesn’t allocate an extent until you insert some data into it.
A couple of thoughts that follow on from the above observations: there is an “allocation unit” for space at the operating system level in Windows, which defaults to 4KB on my little laptop. Clearly it would make sense to match the size of the allocation unit to the size to the database page. In fact, since I also noticed that SQL Server does a read-ahead for an entire extent when possible, there is a case for creating an allocation unit of 16KB, 32KB, or 64KB (if these are possible) and making sure you align the database page boundaries with the O/S allocation unit boundaries. (On the down side, perhaps a larger extent size would result in a “read / fill / write” operation when SQL Server wanted to write a single 8KB block into a 32KB extent.)
Data Storage in Heaps and B-Trees
Moving on from extents to objects, just two clicks away at Table and Index Organization, we find that:
- A table is always partitioned, although it may (and probably usually does) have only one partition.
- Each partition consists of many logical components which may be Heaps or (clustered) B-trees – the “HoBT”. In other words, someone using SQL Server is likely to think of a “table” as something which includes its indexes.
In Oracle, the separation of tables and indexes is more clear-cut, to the extent that a partitioned table may have local or global indexes. In other words, a “single partition” of an Oracle index may cover every partition of a partitioned table, or be partitioned using a different strategy from the table partitioning. - Each HoBT may hold three types of pages – “row data”, “LOB” or “row overflow”
In Oracle, overflow data may be “chained”, holding rows too long for a single block, or “migrated”, holding rows that have been updated and have to move because there is not enough space in the current block for the new length of the row. Nevertheless, “overflow” is not treated differently from any other row data. The LOB columns for a table are, like indexes, given their own dedicated segments.
With a little background, then, it’s time for some technical investigation. How well can I track where the data goes? In the previous article, I made a few comments about the differences in the amount of space used by Oracle and SQL Server for storing the same data; for the same million rows of data, a heap in Oracle used a about 8% more pages, while the non-clustered primary key index used about 7% less. Since data location is key to performance, I’m going to try and find out what they do differently.
We’ll start with the data creation script shown in Listing 1 (see the previous article for the reasons behind the structure of this script). The script is based on a template that I use for generating all sorts of volumes and patterns of data, but in this case creates only 5,000 rows in a heap table with a non-clustered non-unique index:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
CREATE TABLE test_table ( id INT , random_data INT , update_date DATE , vc_small VARCHAR(10) , vc_padding VARCHAR(100) ) ; go CREATE INDEX bt_i_rand ON test_table(random_data) ; go DECLARE @div INT = 50 ; DECLARE @mod INT = 100 ; DECLARE @limit INT = @div * @mod ; DECLARE @driver INT = 1000 ; WITH generator AS ( SELECT 1 AS id UNION ALL SELECT id + 1 FROM generator WHERE id < @driver ) INSERT INTO test_table SELECT id , ABS(xx % @mod) , NULL , NULL , REPLICATE('x', 100) FROM ( SELECT TOP ( @limit ) @driver * ( g1.id - 1 ) + g2.id id , CAST(NEWID() AS VARBINARY) xx FROM generator g1 CROSS JOIN generator g2 ) iv OPTION ( MAXRECURSION 0, FORCE ORDER ) ; |
Listing 1: Creating test_table and populating it with data
During my research into space allocation in SQL Server, I found a script that reported the space used by a table (and its indexes), by querying the sys.allocation_units view. I adapted this query for my own requirements, as shown in Listing 2 (the STR commands are there simply to make the output tidy in SQLCMD).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
SELECT SUBSTRING(tab.name, 1, 16) table_name , tab.object_id object_id , prt.index_id index_id , SUBSTRING(alu.type_desc, 1, 12) alloc_type , alu.data_space_id , STR(alu.total_pages, 8, 0) tot_pages , STR(alu.used_pages, 8, 0) used_pages , STR(alu.data_pages, 8, 0) data_pages FROM sys.schemas sch INNER JOIN sys.tables tab ON tab.schema_id = sch.schema_id INNER JOIN sys.partitions prt ON prt.object_id = tab.object_id INNER JOIN sys.allocation_units alu ON alu.container_id = prt.partition_id WHERE sch.name = 'DBO' ORDER BY tab.name , prt.partition_id , prt.index_id , alu.allocation_unit_id go |
Listing 2: Reporting on space allocation in test_table
After creating a new database (called testdata), and creating the one (heap) table and index shown in Listing 1, the output of the query was as follows:
1 2 3 4 |
table_name object_id index_id alloc_type data_space_id tot_pages used_pages data_pages ---------------- ----------- ----------- ------------ ------------- --------- ---------- --------- test_table 2105058535 0 IN_ROW_DATA 1 81 80 79 test_table 2105058535 2 IN_ROW_DATA 1 25 19 17 |
The index_id is zero for the heap table and 2 for the index. If I had created the index as a clustered index then there would be just one line in this report, with an index_id of 1.
We have 81 pages allocated to the heap table, of which 79 are used for data, and 25 pages allocated to its index, of which 17 are used for data.
In the previous article, I hinted that there appears to be no equivalent in heap tables for the “fill factor” that is an option for indexes (in Oracle, the PCTFREE parameter, which applies to tables and indexes, is effectively the same as “100 – fill factor”). Unfortunately, no-one took the hint so it’s time to find out the hard way how well filled are the table blocks, which means dumping the actual pages…
Investigating Table Blocks using DBCC IND and DBCC PAGE
I had to find a way of listing the pages allocated to the table, and then dump them. Tricky, but I had noticed that whenever you want to do anything subtle in SQL Server you need to use the DBCC command, so I tried a Google search for “DBCC PAGE”, and soon found a collection of interesting articles written by Paul Randall, the first of which gave me the syntax of both DBCC PAGE command, following which I soon found the “DBCC IND” command.
- DBCC IND( database, table, index_id )
- DBCC PAGE ( database, file_id, block_id, level)
I started by investigating the heap, with a call to DBCC IND, shown in Listing 3.
1 |
DBCC ind('testdata', 'test_table',0) |
Listing 3: Calling DBCC IND
Despite what may be suggested by the “ind”, this produces a list of the blocks in the table:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
PageFID PagePID IAMFID IAMPID ObjectID IndexID PartitionNumber PartitionID ------- ----------- ------ ----------- ----------- ----------- --------------- ----------------- 1 154 NULL NULL 2105058535 0 1 72057594038779904 1 153 1 154 2105058535 0 1 72057594038779904 1 157 1 154 2105058535 0 1 72057594038779904 1 158 1 154 2105058535 0 1 72057594038779904 iam_chain_type PageType IndexLevel NextPageFID NextPagePID PrevPageFID PrevPagePID -------------------- -------- ---------- ----------- ----------- ----------- ----------- In-row data 10 NULL 0 0 0 0 In-row data 1 0 0 0 0 0 In-row data 1 0 0 0 0 0 In-row data 1 0 0 0 0 0 |
This tells us that file 1, block 153 is the first data block in the table. So let’s dump it, as shown in Listing 4, and see what’s in it. Note that in SSMS, you will first need to turn on a trace flag (3604), using DBCC TRACEON (3604).
1 |
DBCC PAGE (testdata,1,153,3) |
Listing 4: Dumping page 153
The level 3 dump gives us a full symbolic dump, from which I’ve extracted a few lines:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
PAGE: (1:153) ...{page header was here} Slot 0 Offset 0x60 Length 124 Record Type = PRIMARY_RECORD Record Attributes = NULL_BITMAP VARIABLE_COLUMNS Record Size = 124 Slot 0 Column 1 Offset 0x4 Length 4 Length (physical) 4 id = 1 Slot 0 Column 2 Offset 0x8 Length 4 Length (physical) 4 random_data = 32 Slot 0 Column 3 Offset 0x0 Length 0 Length (physical) 0 update_date = [NULL] Slot 0 Column 4 Offset 0x0 Length 0 Length (physical) 0 vc_small = [NULL] Slot 0 Column 5 Offset 0x18 Length 100 Length (physical) 100 vc_padding = xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx Slot 1 Offset 0xdc Length 124 Record Type = PRIMARY_RECORD Record Attributes = NULL_BITMAP VARIABLE_COLUMNS Record Size = 124 ... Slot 63 Offset 0x1ee4 Length 124 Record Type = PRIMARY_RECORD Record Attributes = NULL_BITMAP VARIABLE_COLUMNS Record Size = 124 |
The block held 64 rows (slots 0 to 63) and each one was 124 bytes, for a total of 7,936. Add a couple of bytes for each pointer in the table of “Offsets”, and the total gets to 8,064. Add the block header (which I believe is 96 bytes) and there are only 32 bytes of free space in the block.
This is rather bad news if you’re hoping to update any of the data in the table. Unless there’s a “fill factor” for heap tables, this default 100% fill gives anyone using SQL Server a fairly compelling reason for using nothing but clustered indexes, unless there are some tables that aren’t going to need any updates.
To demonstrate the problem of updates, let’s perform the following simple update of test_table, as shown in Listing 5.
1 2 |
UPDATE test_table SET vc_small = 'xxxxxxxxxx' ; |
Listing 5: Updating the test_table table
This is what slot 3 (amongst others) looks like after I’ve executed the update and then re-run the level 3 dump command.
1 2 3 4 5 6 7 8 |
Slot 3 Offset 0x1f2 Length 9 Record Type = FORWARDING_STUB Record Attributes = Record Size = 9 Memory Dump @0x4176C1F2 00000000: 040e0100 00010008 00â â â â â â â â â â â â â â â â â ........ Forwarding to = file 1 page 270 slot 8 |
Slots 0, 1, and 2 show vc_small = xxxxxxxxxx, but between them the three rows have taken up 30 of the 32 bytes that I had estimated as free space, so there’s no room in the block for the update to slot 3 and the row has been copied to another block, leaving nothing but a pointer behind. The space freed up by this “row migration” allowed several more updates to take place in the block before it was full again, and I ended up with a total of 5 forwarding stubs in the block.
To an Oracle database designer, if there is no specific need to use a clustered index, to group data in a pattern that’s different from the order that reflects the natural order of arrival of the data, then a heap table is the obvious structure for the table. However, when you implement a heap table in SQL Server, any updates to the data will have a negative impact on performance because of the way that rows will move to different blocks. This, presumably, is one reason why there seems to be such enthusiasm for creating a clustered index on an identity column, as it keeps the data collected in order of arrival but allows you to leave some space in each block for updates by specifying an appropriate fill factor.
What about non-clustered indexes on heap tables? How do they cope with row movement? Here are the first few lines of output from using the DBCC IND command on our index:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
PageFID PagePID IAMFID IAMPID ObjectID IndexID PartitionNumber PartitionID ------- ----------- ------ ----------- ----------- ----------- --------------- ------------------ 1 156 NULL NULL 2105058535 2 1 72057594038845440 1 155 1 156 2105058535 2 1 72057594038845440 1 172 1 156 2105058535 2 1 72057594038845440 1 173 1 156 2105058535 2 1 72057594038845440 iam_chain_type PageType IndexLevel NextPageFID NextPagePID PrevPageFID PrevPagePID ------------------- -------- ---------- ----------- ----------- ----------- ----------- In-row data 10 NULL 0 0 0 0 In-row data 2 0 1 248 0 0 In-row data 2 1 0 0 0 0 In-row data 2 0 1 215 1 210 |
You’ll notice from the PagePID that the index and the table start off sharing the same extent (the table uses blocks 153, 154, 157, 158 and 159 and the index is using blocks 155 and 156).
Block 156 is the IAM page (PageType = 10) for the index, and block 172 is the root block (IndexLevel = 1, so it’s not a leaf block). Lets’ dump block 155, the first leaf block and take a look at the index entries:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
FileId PageId Row Level random_data (key) HEAP RID (key) KeyHashValue ------ ----------- ------ ------ ----------------- -------------- ---------------- 1 155 0 0 0 0x020100000100 (03003c52c4d8) 1 155 1 0 0 0x050100000100 (0600e2cd409d) ... 1 155 143 0 3 0x0D0100000100 (1100bb6857c6) 1 155 144 0 3 0x990000000100 (9d00b21bdb69) 1 155 145 0 3 0x9F0000000100 (a300191bb3a4) ... 1 155 307 0 6 0xBE0000000100 (c50036fde6b2) 1 155 308 0 6 0xBF0000000100 (c600ed093b03) (309 rows affected) |
I’ve shown the start, the end, and a patch in the middle, from the list of leaf entries. You’ll notice that each entry consists of key value (my random_data column), a HEAP RID (row identifier), which seems to be the block address for the key. Since the HEAP RID is also labeled as “key”, I assume that it has been appended to the real key value as an aid to ordering the appearance of duplicates; certainly the RIDs appear to be sorted within key. Interestingly, the byte-ordering of the HEAP RID means that it could, in principle, introduce a greater degree of random I/O than necessary – it looks as if rows with the same key in adjacent blocks may appear at non-adjacent positions in the index.
I’ve printed a section from the index which shows a few occurrences of the random_data value of 3 because it’s reporting one of the rows in the table block I dumped: the HEAP RID = 0x990000000100 matches file 1, block 153 (it looks like first 4 bytes is the block id, last two is the file id).
There’s a problem though: slot 47 of the table block is the slot that holds the value 3, and I can’t see the reference to slot 47 anywhere in the block dump of the leaf block. Does this mean that index look-ups are only accurate to the table block, leaving SQL Server to search the whole table block for the matching key? It seems unlikely, especially when you think of the possible consequences this would have for finding migrated rows.
Fortunately, by switching the dump level to 2 (raw dump row details) we can find the following in the file:
1 2 3 4 |
Slot 144, Offset 0x60, Length 16, DumpStyle BYTE Record Type = INDEX_RECORD Record Attributes = NULL_BITMAP Record Size = 16 Memory Dump @0x4146C060 00000000: 16030000 00990000 0001002f 00020000 â .........../.... |
Note the “2f”; converted to decimal, that’s the number 47 that we needed for the slot identifier. You might also note that this version of the dump doesn’t include anything that looks like the KeyHashValue. Just as in Oracle, some of the dump files report information that is derived at the time of the dump and some of the real information is missing from the dump.
Having tracked down the link from an index leaf to the table row, it takes just a few minutes to check that when a row is migrated from the heap table, the index entry is not updated; it points to the forwarding stub, rather than pointing to the new location of the row (Oracle adopts the same strategy).
Tackling one structure at a time is probably enough. I shall take a look at clustered indexes (unique and non-unique) in the next article.
Conclusions
Unless I’ve missed something obvious, the main conclusion that I’ve reached about SQL Server and heap tables is that you cannot specify the equivalent of a fill factor for a heap table. This means that updates to heap tables will probably result in an unreasonable amount of row movement and so leading to an unreasonable performance overhead when you query the data, because you will have to follow a forwarding link to find the row. The absence of a fill factor is (by itself) enough of a threat to make heap tables in SQL Server fairly undesirable.
Load comments